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Abstract. We address the problem of retrieving information from a noisy version of 
the "knowledge networks" introduced by Maslov and Zhang [1] . We map this problem 
onto a disordered statistical mechanics model, which opens the door to many analytical 
and numerical approaches. We give the replica symmetric solution, compare with 
numerical simulations, and finally discuss an application to real data from the United 
States Senate. 



Keywords: Communication, supply and information networks; Random graphs, 
networks; Message-passing algorithms. 



Retrieving information from a noisy "knowledge network 



2 



1. Introduction 

In a recent paper pQ, Maslov and Zhang addressed the following problem: we are given 
N agents, each one represented by an M-dimensional real vector f^; suppose we know K 
of the N(N— 1)/2 scalar products fly = fi.fj with % ^ j. In this situation, can we predict 
the value of an unknown scalar product f^-? This question is relevant for instance to 
the problem of extracting information from the vast amount of data generated by a 
commercial website. The f may represent in that context the interests of a person i, 
and flij the mutual appreciation of persons i and j; the problem is then to predict the 
mutual appreciation of two persons that do not know each other. Maslov and Zhang 
called the network of interactions and overlaps fijj a "knowledge network"[||. 

One of their main results is the following: there exists a critical density of known 
overlaps p c = 2K/N(N — 1) above which almost all the a priori unknown overlaps are 
completely determined by the K known ones. This transition is a realization of the 
so-called rigidity percolation. However, their treatment leaves several important issues 
aside, and assumes that we have at our disposal much more information that we typically 
do. For instance, the size of the vectors M describing each agent is a priori unknown; 
the problem of estimating M from the data was addressed in [2]. More drastically, the 
data on the overlaps is necessarily noisy: if r*j and fj model the interests of persons i and 
j, their mutual appreciation Q^- is certainly not completely determined by the overlap 
of their interests fi.fj, although it is probably biased by it. In this more realistic case of 
noisy information, the questions are: does the "phase transition" noted by Maslov and 
Zhang survive? And how to retrieve the information contained in the noisy knowledge 
network? We address these issues in the following by studying a simple model of this 
situation. 

The outline of the paper is as follows: we present in section [2] the details of the 
model we are going to study, and the mapping onto a disordered statistical mechanics 
problem, which happens to be the one studied in [3] and more recently in [I]. This 
mapping opens the door to the use of many analytical and numerical methods. In 
section [31 we give the solution of this problem at the replica symmetric level, using the 
cavity method [5]. We then check these analytical results against numerical simulations 
in section 0], and real data from the United States Senate in section [51 

2. The model 

We present now the noisy version of Maslov and Zhang's "knowledge network" which 
we are going to study; for simplicity, the variable describing each agent is discrete, and 
one dimensional. We consider N agents; each one is characterized by an opinion s°, 
with i = 1, . . . , N; the s° may take k different values, and are a priori unknown. The 
s° may be for instance political opinions, as in the example of section El We suppose 
we have some information on the s°, given by a an analog of the "overlaps" of [1] : for 



I These authors actually introduce a bipartite version of these networks. 



Retrieving information from a noisy "knowledge network 



3 



a certain number of pairs we know a number associated to it, constructed as 
follows. If s° = s°, then = 1 with probability 1 — p, and Jjj = — 1 with probability 
p; if s° 7^ s°, then J i<7 - = 1 with probability p, and Jy = — 1 with probability 1 — p. We 
take p < 1/2. p is then a measure of the noise in the information; in the limit p = 1/2, 
the network does not convey any information on the s°. The basic questions we ask 
are: how well can we reconstruct the actual opinions s® knowing the J if. Do we have 
an effective algorithm to do so? 

We are interested in the probability of any set of opinions {.Sj}^!,...,^, given the 
representing our knowledge; from Bayes formula, we can write: 

p miiJij}) = (UiMsi}) ■ (i) 

The factor P({si}) is the prior probability on the s^; we suppose from now on that 
it is flat, so that this term is independent of the Sj. It would be possible however to 
consider another prior probability. The factor P({Jij}) is difficult to compute, as the 
Jij are correlated in an intricated way; however, it is in any case independent of the s,, 
so it acts as a normalization factor for the distribution (CQ). Finally, the P ({Jij}\{si}) 
is easy to compute, since once the Sj are given, the are independent. Let us consider 
two agents 1 and 2 with opinions S\ and S2; then from simple algebra one checks that 

n / n \ 5- ? i2(25s 1 ,s 2 — i) 

m 2 i^ 2 ) = A/— u/— • (2) 



p \ y p 

Since the Jy are independent once the Sj are given, Eq. ([I]) may be rewritten as 

Pfelia-Dan^U- Pj , (3) 

where the index < i,j > means that the product runs over the pairs that are 

connected by a known J^. Taking the logarithm, we have: 

H[{ Sl }] = -Log [P ({ Si }\{ J y })] = Cste - B £ ^{25^ - 1) , (4) 

<i,j> 

with 

B = -Log ( I—* 
2 y V P 

Eq. (jlj) can be seen as the Hamiltonian of a disordered Potts model, which opens 
the door to the use of many analytical and numerical tools to study it. From now on, 
we will concentrate for simplicity on the Ising case, where each agent may have only 
two opinions, Sj = +1 or — — 1. In this Ising case, the Hamiltonian reads: 

H[{ Si }\ = -Log [P ({ Si }\{ Jij})] = Cste - B £ J ij8i Sj , (5) 

<i,j> 

The sets {s^} with maximum probability are the minimizers of Eq. (j3J); the minimizer 
is not necessarily unique. The question, how well can we reconstruct the real opinions 
knowing the is then rephrased as: given a minimizer {s*} of Eq. (j4j), how far is it 
from the real opinions We answer this question in the next section. We note that 
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this rephrasing of the problem bears some resemblance with the community detection, 
or clustering problem as stated in [6]; in this work however, the probabilistic analysis 
yields a Potts-like model without disorder. 

3. Cavity solution 

3.1. Gauge transformation 

Hamiltonian (jSj) is not as well-suited for analytical treatment as it seems to be. It is 
a disordered Ising model, but the probability distribution of the couplings Jy is not 
known, and actually very complicated: the relevant information we want to extract is 
precisely hidden in the correlations between the Jy. The following gauge transformation, 
somewhat miraculously, yields a tractable problem. 

We define Sj = s°Sj and = s^s^Jij (the s° are the true opinions of the agents); 



The distribution of the Jij does not depend any more on the s°: Jy = 1 with probability 
1 — p and Jij = — 1 with probability p: all correlations in the couplings have disappeared. 
Furthermore, given a set {Si}, it is easy to know how far the corresponding set {si} is 
from the original {s°}: it is enough to compute the number of Sj equal to —1. Thus we 
are left with the study of Hamiltonian ([6]), which is that of a ferromagnetically biased 
Ising spin glass. We would like to compute the magnetization of the ground state of 
such a Hamiltonian. From now on, we remove the ~ on the J's and the s's. Let us note 
that the ground state does not depend on B, so we may take B = 1 for simplicity (as 
long as B > 0, that is p < 1/2). All explicit dependence on p is then removed, which 
is very convenient for practical purposes, as p is a priori unknown: the knowledge of 
the is sufficient to determine the minimizers of (jHJ) • We need however to keep p as a 
parameter in the theoretical analysis, and will turn later to the issue of estimating it. 

3.2. Replica symmetric solution 

It turns out that the ferromagnetically biased Ising spin glass given by Hamiltonian (JSJ) 
has been studied recently by Castellani et al. in [I] for fixed connectivity graphs. In 
the present context, it is more natural to consider random graphs of Erdos-Renyi type, 
with a Poissonian distribution of connectivity. However, such a change from fixed to 
Poissonian connectivity usually does not induce any qualitative change in the phase 
diagram. 

Castellani et al. use the cavity method [5] to compute, among other quantities, the 
one we are interested in: the ground state magnetization as a function of the parameter 
p. Let us summarize briefly their main results: at low p, the ground state is replica 
symmetric and magnetized, the ground state magnetization approaching 1 when p goes 
to 0; at some critical p = Prsb, the replica symmetry is broken, but the ground state 



then 




(6) 



<i,j> 
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Figure 1. Left: schematic representation of an iteration, leading to Eqs. ([9]). The 
fields Mi, . . . , Uk are all equal to 0,1 or —1; k$,k + and fc_ are respectively the number 
of fields equal to 0, 1 and —1. Right: schematic representation of the addition of a link 
with coupling J, leading to Eq. (fT2]) . 



is still magnetized; finally, for p > p c , the ground state looses its magnetization. When 
the connectivity of the graph increases, this picture is unchanged, but the value of Prsb 
and p c increase. 

We give now the replica symmetric solution of ([6]), for an Erdos-Renyi random 
graph, with a Poissonian connectivity distribution n(k), of degree 7. 

The calculations closely follow those of [I] for fixed connectivity. The cavity messages 
u sent by the sites along the links take only the values +1, —1 and 0. At the replica 
symmetric level, the system is then described by a single probability distribution: 

V{u) = q + S(u - 1) + q S(u) + q_5{u + 1) . (7) 

We write a recursion relation for the probability distribution V as follows: 



k 



k r I \ 

V{u) = J2 e ~"ji E J / n "=i^(^) 6 [ u - sgn(jj>) , (8) 

k=0 J \ i=l/ 

where sgn is the sign function, taken to be zero when the argument is zero; Ej means 
"expectation" over the coupling J. Eq. [8] straightforwardly translates into three fixed 
point equations for q , q + and g_ (see Fig. CD for an explanation of k + , k- and ko): 

00 k k + k- k 

k=0 k =0 fc+=fc_ 



k=0 



feo+fc++fe- =k 

k k + kn k k + fc_ kn 

P )Y y q+q ~ q ° + P y y q+ q - q ° 

l > ^ ^ k + \kJ.k \ ^ ^ ^ kJkJ.knl 

ko=0 k + >k_ ^ k =0 k + <k_ 



ko+k++k-=k k(t+kj r +k-=k 

q ■ (9) 
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Once q , q + and q_ are known, the ground state magnetization is given by the expression: 



m 



E 



k=0 



E E 

k =0 



fc+ k- kn 

Q+Q- % 



fc_l_>fc_ 



+ u fc =0 k + <k_ + u 

-k fco+fc++fc— =k 

To compute the ground state energy, one computes the energy shifts AE S due to the 
addition of a site, and AEi due to the addition of a link. One gets after straightforward 
calculations: 



k+ fc_ kn 

Q+Q- % 



.(10) 



E 



k=0 



AE, 



9 



2 



fc+ fc_ kn 

Q+Q- % 
kJ.kJ.U- 



-k - \k+ - 



l-2p " 2g0 + g ° 
The ground state energy e gs is then given by 



AE S - ^AE t . 



'ID 



(12) 



(13) 



The qualitative picture emerging from this replica symmetric analysis is the 
following: for each mean connectivity 7 > 1, there is a critical value (7) such that 
for p < Pc (7), it is possible to extract information from the knowledge network. The 
error rate e in the N —* 00 limit is directly related to the ground state magnetization 
m : 

L - m(p,7) 



e(p> 7) 



For p > p^: (7), it is not possible any more to extract meaningful information from the 
data in the limit N — > 00: the error rate tends to 1/2. 



3.3. Discussion 

We compare these replica symmetric analytic results to numerical simulations in the next 
section. We can make however some a priori remarks on the validity of the calculation. 
First, we expect the calculations to be exact at small enough p; we then expect a 
replica symmetry breaking transition at some PrsbI'i) < P^i'j)- For p > Prsb(j), the 
replica symmetric results are not reliable any more. We expect that the phase transition 
described above towards a non magnetized ground state is shifted to some p^ SB ^ p^ s . 
However, the qualitative result of a transition between one phase which contains some 
information and another one which does not should still hold true. 

Another word of caution is in order: the authors of [1] note strong finite size effects 
for a fixed connectivity network; this is likely to be the case also for a Poissonian network, 
and it may smear out somewhat the transition for finite N. 



3-4- Estimating p 

As already noted above, Eq. ([6]) only depends on p through the parameter B, so 
an a priori knowledge of p is not necessary to carry out the minimization. This is 
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an interesting practical advantage. However, the amount of errors contained in the 
minimizer strongly depends on p, as explained above. So it would be useful to have 
some information about the value of p, to get an estimate of the amount of errors 
contained in the ground state. It is indeed in some cases possible to estimate p from the 
only available data, the Jj/s. Suppose we are given a network. It is possible to compute 
for this network eas(p), the ground state energy as a function of p, by randomly choosing 
the Jjj's with probability p\ this can be done analytically in some cases with the cavity 
method, or numerically. Then one computes the ground state of the network with the 
real J^'s from the data; comparing with the easip), one gets an estimate of p, provided 
the ecs(p) curve is not flat. 

4. Numerical simulations 

We now compare the analytical prediction of the previous section to data generated 
randomly: we randomly assign a value ^ 0) = 1 or Sf } = -1 to N spins; we randomly 
draw a network connecting these spins, and randomly assign a value 1 or —1 to each 
link Jij connecting spins i and j, following the rule: 

Jij = S^Sj ^ with probability 1 — p , 

Jij = — 5^ with probability p. 

We then numerically minimize the corresponding Hamiltonian. For this purpose, we 
may use simulated annealing. It is simple to program, but not very fast, and does 
not perform well in the replica symmetry broken phase. However, the structure of the 
problem may suggest to use another class of algorithm, intensively studied in different 
contexts recently (see for instance [9] for a pedagogical introduction in the context 
of error correcting codes): Belief Propagation (BP). BP is not expected to perform 
better than simulated annealing in the replica symmetry broken phase, and it may 
sometimes fail to converge. However, it performs overall very well, and is much faster 
than simulated annealing, which allows to reach higher N: this is crucial to deal with 
large data sets. 

On Fig. [2j one sees that the agreement between simulations using BP and replica 
symmetric calculations is very good for low p. For larger p, there are important 
discrepancies, that may have two origins. First, one expects a replica symmmetry 
breaking, as in |5|; this means that the replica symmetric calculation is not exact any 
more, and that BP is not expected to perform well. Second, as already noticed in [I], 
finite size effects are strong. However, the numerical results seem compatible with 
the main analytical finding: the presence of a transition between a low p phase which 
contains information, and a high p one that does not. We also note that the error rate 
obtained with BP is always smaller than the theoretical one estimated from the replica 
symmetric analysis. Finally, it is interesting to compare quantitatively these results 
with those of [I] for regular graphs: both theory and numerics predict a significantly 
higher threshold between the informative and non informative phases for a Poissonian 
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Figure 2. Energy (left) and error rate (right) as a function of p, for a 7 = 3 Poissonian 
random graph. Symbols are from numerical simulations using the BP algorithm, with 
N = 8000; the solid and dashed lines are the replica symmetric analytical results. 



network, for a given mean connectivity. 

BP does have another big advantage over simulated annealing: its outcome is a 
magnetization for each site; so we also have an indication on which sites are most likely 
to be wrongly guessed (those with magnetization close to zero). As a final remark, 
it could be possible to improve performance in the replica symmetry broken phase by 
using a survey propagation algorithm [8]. 



5. The US Senate example 

The analytical results of section [3] are strengthened by the numerical simulations of 
section HI however, unlike the numerical data, any real data set does not follow exactly 
the probabilistic model underlying our study. It is thus important to assess how robust 
are the results with respect to some uncertainty in the model. In this section, we 
will analyze data from the United States senate votes, and show that the strategy 
of minimizing Hamitonian ([6]) does allow to retrieve some information from the data; 
the amount of information retrieved is in reasonable quantitative agreement with the 
predictions of section [^|. 

We consider here as agents the 100 US Senators serving in 2001. The party of 
each senator plays the role of the unknown opinion s°; say s° = —1 if senator % is 
a Democrat, and s\ = 1 if senator i is a Republican. On the US Senate website 



(http://www.senate.gov/), the voting positions of all senators are available for the so- 



called "roll call votes". We expect that senators from the same party tend to cast the 



§ We certainly do not claim that the present method is the best possible to extract information from 
the US Senate data; we only try to test the robustness of our results on a real data set. 
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Figure 3. The parameter p is fixed, p = 0.2. For each value of 7, the symbols 
correspond to 100 realizations. The solid line is the analytical replica symmetric result. 

same vote, and senators from different parties tend to vote differently, although it is of 
course not an absolute rule. 

We construct an instance of the "knowledge network problem" as follows: 

• We pick up a random network with given parameter 7, and the senators as nodes. 

• For each edge of the network, linking two senators with labels i and j, we pick up 
randomly one roll call vote in early 200l{jj] and consider the voting positions of the 
two senators i and j. If they casted the same vote, we set = 1; if they casted a 
different vote, we set J^- = — 1. 

Varying the random network and the random pick of the roll call votes for each link, we 
can generate many different instances of the "knowledge network" for each 7. 

As senators from the same (resp. different) party tend to cast the same (resp. 
different) vote, they tend to be linked by edges with positive (resp. negative) J's. The 
fact that senators do not always vote like the majority of their colleagues from the 
same party plays the role of a noise. We crudely model this situation as in section [2j 
assuming that Jy = with probability 1 — p, and = — s°s° with probability p, 
p being unknown, smaller than 1/2. We now want to retrieve some information about 
the s°'s (ie the party of each senator), using the method described in this paper. 

Based only on the set of the Jy, we run the BP algorithm for each instance of the 
"knowledge network", without using any a priori knowledge on the parameter p; we then 
split the senate in Republicans and Democrats, according to the BP results. We can 
check how many errors we have, and compare with the theory of section [3l Note that 
we can choose the connectivity of the random network 7. We have no control however 
on the parameter p. 

|| In practice, we have collected the data from 50 roll call votes in early 2001 
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The results are presented in Fig. [31 and compared to the replica symmetric 
analytical calculations. They seem to be consistent with the main qualitative analytical 
result: the existence of a threshold separating a phase containing almost no information 
(low 7) and a phase which contains some (high 7). We also see on Fig. [3j that there 
is a strong sample to sample variability; for small error rates however (large values of 
the mean connectivity 7), the agreement is rather good; for smaller 7, the agreement 
is poor. There are two explanations for that, besides the fact that the votes are not 
random: replica symmetry is probably broken, and, which is more important for such 
small systems (N = 100), finite size effects create large bias. We note however that the 
practical error rate is usually smaller than the analytical one. 

6. Conclusion 

We have extended the "knowledge network" formalism of [1] to the more realistic case of 
noisy data. We have shown that there is a phase transition between an information-rich 
phase, and a phase that essentially contains no information. In the former situation, 
the information may be efficiently retrieved through a Belief Propagation algorithm. 

There are several possible extensions to this work. The most direct ones are 
the study of non-binary opinions (Potts-like models), or multidimensionnal opinions. 
With the applications to commercial websites in mind presented in [TJ [2] , it would also 
be interesting to consider bipartite networks. For all these cases, it seems that the 
disordered statistical mechanics point of view used in this paper may be fruitful, by 
suggesting the use of some powerful analytical as well as numerical techniques. 
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